Training a perceptron in a discrete weight space 
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On-line and batch learning of a perceptron in a discrete weight space, where each weight can 
take 21/ + 1 different values, are examined analytically and numerically. The learning algorithm is 
based on the training of the continuous perceptron and prediction following the clipped weights. The 
learning is described by a new set of order parameters, composed of the overlaps between the teacher 
and the continuous/clipped students. Different scenarios are examined among them on-line learning 
with discrete/continuous transfer functions and off-line Hebb learning. The generalization error of 



the clipped weights decays asymptotically as exp{—Ka )/exp{- 



in the case of on-line learning 



with binary /continuous activation functions, respectively, where a is the number of examples divided 
by N, the size of the input vector and K is a. positive constant that decays linearly with 1/L. For 
finite and L, a perfect agreement between the discrete student and the teacher is obtained for 
a oc ^L\n{NL). A crossover to the generalization error otl/a, characterized continuous weights 
with binary output, is obtained for synaptic depth L > 0{^/N). 
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I. INTRODUCTION 

Neural networks and the perceptron as the simplest 
prototype have become most popular as a tool for under- 
standing human learning and as a basis for many various 
applications We are interested in the perceptron 

learning ability as an archetype for machines that are able 
to learn. Most of the perceptrons that have been studied 
until now are under two totally different constraints, two 
extremes. Either the teacher weight vector is restricted 
to a binary space, (the Ising teacher), or it is continuous, 
confined to a hypersphere. Only a few aspects of the 
learning ability of weights which are confined to have a 
finite number of values have been studied, although the 
realistic case on digital computers where numbers have 
a finite depth representation is described by this model. 
Furthermore, the applicability of neural networks to bi- 
ology and to the construction of real devices requires the 
understanding of the interplay between the weights depth 
and the network ability of learning. Those systems are 
the intermediate case, in which the weights are confined 
to finite space, (2L -I- 1)^, when L is an integer and N 
stands for the input size. jsj-H]. 

The generalization ability of such networks, in which 
the synapse has a finite depth has been analyzed by using 
replica calculations and has been found to have interest- 
ing nontrivial behavior of phase transition. The learn- 
ing procedure composed of two phases; one in which the 
learning ability is very limited, the generalization error is 
finite, another phase is when the generalization error is 
exactly zero, perfect learning is gained and it happens in 
a finite a, where a is the number of patterns divided by 
the size of the input N, Nevertheless, replica calcu- 
lations do not involve practical algorithms that one may 
use in order to obtain that learning behavior. In the 
Ising case, for instance, although a phase transition was 
predicted, no practical algorithm reproduces this discon- 



tinuous behavior 

In contrast to the batch learning, when all the exam- 
ples are used together to achieve perfect learning, on- 
line learning is a procedure in which an update rule is 
used and learning in each step utilizes only the last of 
a sequence of examples. Such an algorithm drastically 
reduces the computational effort compared with batch 
learning and no explicit storage of a training set is re- 
quired It was shown that there is no updating rule 
that uses only the discrete vector for updating and results 
in perfect learning 

In this paper we address the issue of practical learning 
from a finite depth teacher. The method we introduce is 
based on the clipping of a continuous perceptron. Hav- 
ing an artificial continuous weight vector enables smooth 
learning; clipping it results in a discrete student , 
whose components are close to those of the teacher. This 
method has been used successfully in the Ising perceptron 
]6|,p] pT|JTl[ | . The questions that arise from the procedure 
above are; whether learning is possible at all and if it 
is possible, does it give better results then the learning 
in a continuous space. It seems very natural that if the 
weights" depth is very large, i.e. there are many possi- 
ble values to each weight, the learning behavior of the 
discrete weights will be exactly the same as those of a 
continuous weight. However, in the following we exam- 
ine if and what are the scaling relations between both 
properties, L and N. 

Our main results are: (a) Learning in the case of finite 
depth is possible by using a continuous precursor. This 
result was confirmed both analytically and numerically, 
(b) In the on-line learning scenario: Having a binary out- 
put results in a fast decay of the generalization error and 
at the large a regime it decays super-exponentially with 
a, eg cx exp(— ifa^) where K is some constant. Hav- 
ing a continuous output results in a much fast decay of 
the generalization error,exp(— isTi exp(iir2a)), where Ki 
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are positive constants, (c) In batch Hebbian learning, 
having a binary activation function, the generalization 
error falls exponentially with a. (d) Perfect learning 
is obtained when N is very large but finite, unlike the 
continuous perceptrons performance. Quantitatively, for 
a given N and L the perfect learning is achieved for 
a/ oc 0{y/ L \n{LN)). (e) A crossover to the behavior 
of the generalization error in the presence of continuous 
weights occurs for L > o{Vn). 

The paper is organized as follows: In section II the 
architectures and the dynamical rules are defined as well 
as the continuous and discrete students. In section III 
the order parameters are defined and the relations be- 
tween the overlaps of the continuous teacher with the 
discrete/continuous students are derived analytically. In 
section IV, the dynamical evolution of the order param- 
eters in the case of binary output is derived analytically 
and confirmed by simulations. Both, on-line scenario and 
Hebbian learning are examined. In section V the case of 
large synaptic depth and the crossover to the continuous 
weights is studied. In section VI, the perfect learning in 
finite N systems is examined both analytically and nu- 
merically. Section VII is devoted to analyze results in 
the case of continuous output. Finally, in section VIII 
results are concluded and open questions are addressed. 



II. THE MODEL 

A. The Architecture 

We investigate a teacher-student scenario where both 
nets are single-layer feed-forward. The examples are gen- 
erated by the so-called teacher, which is known to be re- 
stricted to a well-defined discrete set of values. We define 
a synaptic depth L and a set of digital values to be as 
follows 
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(1) 



In case that the zero value is part of the game, the pos- 
sible values of the weights are 



W/ 
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± 1. 



(2) 



For the sake of simplicity, we present results in this paper 
only for the including zero case (Eq. ||). It is easy to 
generalize our results for the other case, (Eq. 

The input patterns are chosen at random and inde- 
pendent of each other. In the following they are drawn 
from a Gaussian distribution with zero mean and unit 
variance. The size of the teacher, the student and the in- 
put is N. For any input ^ the so-called teacher generates 
an output, S, according to some rule 



N 



(3) 



In the following we will discuss both binary and contin- 
uous rules. The student has in mind the rule F and the 
discrete set of values that the teacher is confined to. In 
addition, in an on-line learning scenario, the student is 
given in each time step, /i, the input £^ and the teacher's 
output 5'', whereas in batch learning the set (^'^S"^) 
fi = l...aN is given altogether. 



B. Dynamics of the Weights 

A continuous precursor for the student, J is needed for 
learning from a discrete teacher. The learning procedure, 
having a continuous student, is well known. In an on-line 
scenario at each step the continuous student updates its 
weight vector according to some learning algorithm (f). 
The generic form of the learning algorithm is 



(4) 



where rj is the learning rate and xj is the student's local 
field, xj = • C- Such a learning algorithm means 

that at each learning step /i, the current weight vector 
is updated according to the new example, £^ and each 
example is presented only once. 

In an off-line scenario, there is a set of examples 
fi = l...aN and they are used altogether to gain perfect 
learning. There are methods in which the off-line lean- 
ing is made according to a rule that defines an additive 
quantity of all the examples. The Hebb learning is an 
archetype of those methods. 
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Such procedures were shown to end up in perfect learning 
|]l^ , p^ . Since having a discrete teacher is merely a special 
case, not using the knowledge that the teacher is confined 
to a discrete set of values gives the well-known results; an 
exponential decay in the case of continuous rule (on-line 
learning |l^,|l^) and a power law decay in the case of 
binary rule (on-line and off-line learning ||l^-|l8||). 

The way to gain from the knowledge of the discrete 
nature of the weights is in the center of our work, and 
it is based on having in addition a discrete student 
derived from the continuous one using the following clip- 
ping procedure. A continuous weight is clipped to the 
nearest discrete value, among the 2L + 1 possibilities. 
Such a clipping procedure is the optimal one with the 
lack of any prior knowledge about the weights accept 
that each value appears with the same probability. We 
define limit values. A/, which are arranged in increasing 
order. The limit values divide the continuous region of 
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the precursor weight vector components to 2L + 1 inter- 
vals, according to the number of the available values as 
m Eq. I The clipping process is such that Jt is mapped 
onto for Ji e (A/, A; + 1). The set of Hmits includes 
{A_i, A_i_i, ...A_i, Ao, Ai...A;+i}. It is given by the fol- 
lowing mathematical rule: 



W- 



l=~-L 



JO - 9(Xi - J,] 



(6) 



where 9 is the Heavyside function. 

Since the value of those limits, A/, is somewhat unclear, 
we would like to exemplify it with some specific cases. In 
the case of L = 1 ,Eq. |l|, for instance, due to symmetry 
it is obvious that the limit between -1 and 1 should be 0. 
Hence, one introduces the following limits, A_i = — oo, 
Ao = 0, Ai = oo. Evaluating the mapping equation re- 
sults in the well known clipping rule, Wf = sign{Ji), 
[^,0. Finding the appropriate value for all other cases 
but the Ising perceptron becomes more complicated, the 
continuous space is no longer divided into two clear re- 
gions and hence one has to consider carefully the value 
of the limits. 

In this paper we chose to nail down the general results 
by focusing in the including zero case, L = 1, i.e., Wi ~ 
0, ±1. This case is known as the diluted Ising case and 
some other aspects of it have been studied in references 
^^pT| . It contains the simplicity of the Ising case on the 
one hand and introduces a more generality concerning 
digital values on the other hand. In this case, there is 
only one unknown parameter, Ai, since A2 = — A_i = 00, 
and Ao = — Ai. 

While choosing the value of the limits, (in the last case 
that means only choosing the value of Ai) one should 
take into consideration the a priori knowledge about the 
weights of teacher. It is clear that the limits should scale 
with the student norm, since the exact set of values that 
the continuous student end up with is irrelevant. The 
mapping rule ensures that the digital student ends up 
with the same values as those of the teacher. This will 
be shown only after analyzing the new order parameters 
and their dependent on the former one, as is presented 
in the next chapter. 



III. THE ORDER PARAMETERS 

Evaluating the agreement between teacher and student 
is made by calculating either the generalization error or 
the order parameters. The generalization error, e^, is 
calculated by taking the average of the student/teacher 
disagreement over the distribution of input vectors. The 
generalization error is given, in principle, by the overlaps 
between the vectors, (the so-called order parameters). 
However, in order to get into details one has first to de- 
fine the rule, ( F in Eq. H). This will be done in the 



next sections. In the following we concentrate in intro- 
ducing the complete set of order parameters and their 
inter-relations. 

In our case there are three vectors and hence two in- 
terdependent sets of order parameters; one set concerns 
the continuous overlaps, 

Rj ^ 
Qj = ^J- J, 



(7) 



and one set concerns the digital vector's overlaps, 
Rw = ^W^' ■ W^, 
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(8) 



Note that the dynamical evolution of the continuous set 
of order parameters, Eq. 0, is independent of the clipped 
order parameters, since the training is done only follow- 
ing the continuous weights. In contrary to the train- 
ing process the prediction and the generalization error 
is made following the clipped student. Hence, finding 
the quantitative interplay between the continuous set of 
order parameters, Eq. 0, and the discrete set of order 
parameters, Eq. H, is the cornerstone for the analytical 
description of the generalization ability of the student. 

In this section we examine the relation between the 
clipped set and the continuous one. The development of 
Rji Qj is not influenced by the clipping method. Hence, 
finding out the above relation enables finding the devel- 
opment of the clipped order parameters and results in a 
description that gives the whole picture of the learning 
process. 

The teacher's norm is determined according to the a- 
priori probabilities for each discrete value. Having equal 
probability and taking the thermodynamic limit results 
in the norm, 



1=1 



(9) 



where ni^ defined to be the number of optional values, 
ni^ = 2L + 1. The order parameters in the clipped ma- 
chines, Rw and Qwj as a function of those of the con- 
tinuous machine, Rj and Qj, are evaluated as follow: 
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(10) 
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where < A > is an average over the known constraints 
and the known overlaps, 



Tr^T J dJ.SiJf ~ NQj)6{J,Wf - NRj)A 



TrwT J dJ,6{Jf - NQj)5{ J^W^ 



NRj) ■ 
(11) 



The vahdity of this average is based on the assumption 
that all vectors J which are consistent with the con- 
straints are taken with equal probability. This assump- 
tion is violated in case that the updating of the contin- 
uous vector itself is made according to the clipped one, 
see 1,0. 

The results are: 

Rw = in^rJi^i+w) - er/($u')], 

Qw = l^[erf{^i+i.i') - er/(<i>M')], 

(12) 

where the summation is over all the possible values, start- 
ing from I, I' = — L, — L + 1, L and we defined 
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(13) 



are the geometrical 



where pj 



order parameters. 

In the limit L — > cx) the summation in Eq. ^ can 
be replaced by an integral. Calculating the integrals in 
this limit results in the obvious identities, Rw — Rj 
and Qw = Qj- Note that taking integrals instead of 
summation imposes an inequality. The difference — 

(see 



'^i+i.v tends to zero as long as L » l/^/l 
Eq. [l3| ). Hence, in the event that L is very large, learning 
with the continuous student or learning with the clipped 
version performs the same result as long as pj is smaller 
then 2/L. This limit is discussed in section VI. 

We exemplify the general results in the case of the di- 
luted Ising perceptron. In that case we used the following 
limits, 



A. 



-A_i = oo 
Ai = — Aq 



(14) 



and the teacher's norm is T = 2/3. The mapping above 
gives 



Rw = ^[erfiA+) + erf{A^)] 
}w^l- lerfiAo) + lerf{A_) - lerf{A+) 



(15) 



were A± 



pj/VT±Xi/y/Q] 



and A, 



From Eq. Qq one can verify that at the limit a — > cxd 
when the continuous order parameters achieve a perfect 
learning, pj — > 1, the discrete order parameters achieve 
perfect learning as well, Rw — > 2/3, Qw — *■ 2/3 and 
Pw 1 given that the positive quantity, Ai, is smaller 
than Ai < y^Qj/T. 

In general, in order that the digital student will 
gain perfect learning it is necessary that the relation 
^Qj/T(l - 1) < A; < ^jQj/Tl holds for any positive 
I. Note that the interpretation of the above constraint 
is that in the vicinity of perfect learning the precursor 
might be focused around any set of discrete symmetric 
values, but not necessarily the ones that the clipped stu- 
dent has. 

One of the conclusions concerning A; is that the law 
according which eg decays is independent of the exact 
value of the limit value. A;. It depends only on the ruler 
(binary/continuous), the specific strategy of learning (on- 
line/off-line) and the learning algorithm one uses. In the 
following we analyze all these variations. 



IV. BINARY OUTPUT 
A. On-line Learning 

In an on-line learning scenario one can write equations 
of motion that determine the development of the order 
parameters as a function of a. The rate of convergence 
depends on the rule, F (Eq. ^) and the learning algorithm 
that one uses f (Eq. ^) . Fine tunes are made by choosing 
the learning rate, 77. 

We analyze learning procedure in the case of binary 
rule. 



S = sign{x), 



(16) 



where x is the local field and the generalization error as 
a function of p is known to be 



— cos ^(p) 

TT 



(17) 



Although it was shown that using the "expected stabil- 
ity" algorithm that maximizes the generalization gain per 
example leads to an upper bound for the generalization 
ability, ||l^, we choose to concentrate on the so-called 
AdaTron or relaxation learning algorithm. This latter 
algorithm for zero stability, k ~ 0, performs comparably 
well and unlike the "expected stability" algorithm does 
not require additional computations in the student net- 
work besides the updating of its weights, and the analysis 
is simpler as well Q . 

The convergence to perfect learning depends on the 
learning rate, if it is too large perfect generalization be- 
comes impossible. The transition from learnable situa- 
tion to unlearnable occurs at rjc- In the following, in 
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order to simplify the analysis; we choose a fixed learning 
rate, i] = 1, which is below rjc in all scenarios. 

We update the artificial continuous weight vector, J. 
The updating is made as in Eq. ^ according to the fol- 
lowing learning rule: 



-^{-j^)i':e{^'-^sn (18) 



J- 



The equations for the order parameters with 77 = 1 are. 



dp.j 



Pj_ 

dQj 
da 



1 



(19) 



In the limit a — > 00, one obtains a power law that de- 
scribes the convergence of pj and Qj, 

,37r,n 1 
4 

Oj~go(i-^'(7)'A) 



(20) 



Note: since we have a binary output unit, perfect learning 
is gained as soon as the angle between the vectors goes 
to zero independent of the student's norm. 

The solution of Eq. |l^ only describes the development 
of the continuous perceptron's overlaps. The next step 
is mapping the continuous precursor to the clipped one. 
Since in the case of binary ruler the student's norm con- 
verges to some unknown value, one way of choosing A; 
is simply "half the way" between the constrained values, 
i.e. X-L = Al+1 = infty and otherwise 



(21) 



The development of the order parameter pj, is inde- 
pendent of the norm Qj. Using a limit set that scales 
with y/Qj, Eq. ends up in pw which depends on 
pj but does not depend on Qj. Hence, plugging into it 
Eq. one can find the the asymptotic behavior of the 
generalization error, Eq. |l^ in the limit a — s- cx) 
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exp {-K{X)a^ 



a2 



(22) 



where if (A) = min|A/ - ^y^Qj/T\. 

We exemplify the aforementioned discussion in the di- 
luted Ising perceptron. We use the limits as in |lj and 
assume Ai = c^/ Qj/T. In that case 



Pw 



erf{a+) + erf{a^) 



VT^9 - 3er/(ao) - 3er/(a+) + Serf (a.) ' 



(23) 



where a 
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of large a one finds 



and flo 



exp (— 6c«^) 



and otherwise hr 



In the limit 
(24) 



3^2 



where for c > 1/2 6c — -f- 

One can see that choosing c 1/2 results in a fastest 
decay of the generalization error 
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FIG. 1. Simulation results of ln{eg) of the continuous pre- 
cursor (o) and of the clipped vector vs. q^. The clipping is 
made accordin g to th e mapping in |l4| wh ere the results are 
for Ai = 0.5y/Qj/T (v) and Ai = 0.3y/Qj/T (A), error 
bars are smaller then symbols. Solid lines are the numer- 
ical integrals (Eq. ^9|). pr refers to the point at which a 
transition occures between a superior performance by contin- 
uous/clipped perceptron, (see text). 

The analytical results are compared with simulations 
on a teacher of the type of the diluted Ising perceptron 
with the following parameters; A = Q.b^jQj/T and A — 
0.3^Qj/T, see Figure |l[ The initial conditions for the 
continuous student weight vector are Qj{a = 0) = T — 
2/3 and Rj{a = 0) = 0. The weight components were 
drawn out of a Gaussian distribution. We used r] — 1, 
N = 3000 and each point was averaged over 50 samples. 
One can see in Figure |l| that the analytical results give 
by Eq. 23 and Eq. |^ are in agreement with simulations. 

One can see that the super-exponentially decay is in- 
dependent of the accurate value of A. However, two im- 
portant parameters do depend on the exact choice of A. 
One is the decay rate, the factor K{X) in the large a 
limit. One can see, for instance, that the optimal limit, 
A ~ 0.5 results in a faster decay than the limit A = 0.3. 
The second is the exact a or the exact value of pj at 
which the clipped version gives a better result than the 
continuous one. We named this value as px- For pj < px 
the clipping lowers the overlap pj since the learning solu- 
tion does not contain enough information about the real 
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direction of the teacher, W'^ , so that chpping only leads 
the solution to forget a little about the learned pattern 
without bringing it closer to the exact solution. In the 
other region, when pj > px, clipping becomes efficient 
because the learning solution is near the exact one. The 
numerical results of px according to the mapping, (Eq. 
H), are pr ^ 0.92 for Ai = O.b^/OJJT and pr 0.97 
for Ai = 0.3^Qj/T, see Figure 0. 

B. Clipped -Hebbian Learning 

Ising perceptron, diluted Ising perceptron and all the 
binary units that arc confined to discrete values exhibit 
a phase transition |^,^,^ . This known result was hard 
to achieved by a practical algorithm. One way to gain a 
perfect learning is to include the information of all the 
patterns simultaneously in the weights by using the Hebb 
learning procedure, Eq. ||. Such a learning will end up in 
a discrete student only in the limit a ^ oo. The decay 
of the generalization error in that case is known, since 
it is exactly the same as having a continuous teacher, 
Eg cx 1/ ^/a @jl5| . In such a way the knowledge of having 
digital values is not used, one has a continuous student 
that happens to realize, after learning, that the values 
are constrained to a finite depth. 
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FIG. 2. Analytical results of pr, as a function of the limit 
(Ai) in the diluted Ising case, pr stands for the continuous 
overlap value at which below/above it, a better generalization 
is achieved by the continuous/clipped perceptron. 

The above mentioned procedure describes a way of us- 
ing two vectors. A continuous one, which is evaluated 



according to the Hebb rule and a discrete student, ob- 
tained by clipping the continuous precursor according to 
Eq. |. The latter mapping results in a better generaliza- 
tion error for large enough a. 

We take for example the diluted Ising case. Given that 
the continuous student is normalized to be the same as 
the teacher one, Qj = T = 2/3, one can find the exact 
point, Pt at which the clipped method results in a bet- 
ter generalization. This value depends on the limit one 
chooses. One can see that the limit Ai, that results in a 
better generalization of the clipped version in smaller p, 
is Ai '-^ 0.43 -^/Qj/T, see Figure |[ One might anticipate 
to get as a result Ai ~ 0.5, that was found to optimize 
the decay as a ^ oo. However, the above value is deter- 
mined by the distribution of the continuous weights in 
the beginning of the learning process, small a. In this 
regime the distribution of the weights is close to a Gaus- 
sian, and its tail influences the value of Ai. The analysis 
above indicates that choosing a--dependent limits, A(q;) 
in this specific case might perform an even better gener- 
alization curve. 

To conclude, the benefit from the clipping is evident 
only after the Hebb solution is near the exact one, after 
gaining large p. For optimizing the learning time, choos- 
ing the limits should be done cautiously. If the aim of 
the learning is to minimize the generalization error at 
the very end of the procedure, after a long learning pro- 
cess, than the best choice for the limit will be the "half 
the way" method, Eq. However, to minimize the 

generalization error for a given finite a, the best value 
might be around Ai ^ 0A25y/QjjT. These results sug- 
gest that it is possible to optimize the generalization error 
of the clipped perceptron by the choice of a dynamical 
Ai = Xi{a). 

V. LARGE SYNAPTIC DEPTH 

In this section we examine the crossover of the gener- 
alization error in the presence of continuous weights as 
we increase the synaptic depth. As long as the synaptic 
depth L < 0{\/N), the generalization error still van- 
ishes super-exponentially, Eq. ^3, where the pre-factor 
decreases with L. For L > 0{^/ N) the learning is charac- 
terized by the features of spherical constrained learning. 

A first step towards the continuous case limit is to find 
out the change of the decay of the generalization error 
as a function of L. We focus on the binary unit in the 
on-line scenario. The analytic tractability of this model 
enables a profound study of the influence of the synaptic 
depth over the learning features. 

In the last model the generalization decays super- 
exponentionall, eg ~ exp(— ifa^), (see Eq. p^ ). The 
factor K depends on the limits one chooses, A/. Hence, 
in order to keep on consistency, we use the abovemen- 
tioned hmits, (Eq. pTI), in the different depths cases. We 
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should emphasize at this stage that only one out of many 
super-exponentionall terms that arise from the asymp- 
totic expansion of all the error functions (Eq. ^0|), was 
kept (Eq. As soon as the deviations between dif- 

ferent factors in the exponent are too small, one has to 
integrate all the terms together instead of neglecting all 
but one. Such a procedure results in a different type 
of decay, a power law instead of a super-exponcntionall 
decay. 

Analytical and simulations results of the generaliza- 
tion error in varieties of synaptic depths are presented 
in Figure |^. Simulations were carried out with N — 630 
and each point is averaged over 100 samples. The insert 
shows the estimated slope K, (Eq. as a function 

of the depth L. One can see that K decreases linearly 
with 1/L. The deviation from the analytically predicted 
interplay for large a, K cc 1/L, is probably due to finite 
TV effects. 
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FIG. 3. Simulation results of \/—ln{eg) in the case of 
L = 1, (diluted Ising) (x), L = 2 (v), i = 3 (A) and 
L = 157 (O) versus a. The analytical results obtained by 
the numerical integration of Eq. |l^ and Eq. ^ is presented 
for the ISing case (solid line) . The dashed line is the analytical 
curve for —ln{el), were is the generalization error of the 
continuous student. Inset: The dependece of the prefactor 
K{L) on the depth L, in Eq. |22| Solid line is the least 
squered fit, K = 0.06/L. 

In the following we present argument supporting the 
statement that the generalization performance of finite 
depth machines coincide with the performance of contin- 
uous machines as soon as L ^ \/N . This scaling is found 
by taking into account that: (a) The difference between 
two available values is of order of (b) The distribu- 
tion of the continuous student values around the teacher's 
one is a Gaussian with a variance of -^/l — pj = l/fg, 
where is the generalization error of the continuous 
student. Having a learning procedure (in the continuous 



space) in a finite dimension results in a generalization er- 
ror, Eg , which is different then the analytical predictions. 
The variance is of order of \/l/N ||2^. Hence, an esti- 
mation to the order of the lower value that gets in a 
specific run will be ^Jl/N . As a consequence, having a 
discrete machine of depth L when 

(25) 

or L >> \/N , gives the same results as those of contin- 
uous learning. Note that Eq. is consistent with the 
mathematical constraint that was pointed out in section 
HI when we discussed the continuous limit. The simula- 
tions shows indeed that in the case oi L = 157 >> \/7V, 
were N = 630 the discrete vector's performance coin- 
cide with the analytical learning curve of the continuous 
student. 

It is worth pointing out that a similar result was found 
when analyzing the possibility of learning from a dis- 
crete teacher by a discrete student using a general up- 
dating rule, The last analysis uses totally different 
argument results in the conclusion that only when the 
teacher's depth is of order ^/N, it is possible to learn 
the rule using an updating rule that depends on the dis- 
crete weights, i.e. only then it behaves as if we have a 
continuous machine. 

VI. FINITE SYSTEMS - PERFECT LEARNING 

The theoretical results presented in the previous chap- 
ters exhibit the typical behavior of the generalization er- 
ror and the order parameters. The main result is the 
fast decay of the generalization error of the clipped per- 
ceptron to zero, Eq In the case of teacher and stu- 
dent with continuous weights and finite N, the general- 
ization error is always finite distance from zero, even in 
the asymptotic stage of the learning process. In contrast 
to the continuous case, the learning of a perceptron with 
discrete weights and finite N is characterized by a tran- 
sition to perfect learning, as was found for the Ising per- 
ceptron, |pd]| . Performing simulations in that case results 
in a perfect learning in some stage, since in the clipping 
version the student knows exactly the teacher's optional 
values. Hence, the overlap becomes exactly one, pw = 1, 
and the generalization error becomes exactly zero as well, 
e, = 0. 

In order to give an estimation to the number of steps 
needed for getting perfect learning, a/, we use the fol- 
lowing approximation valid in the a oo regime, where 
we can give an analytical approximation to the interde- 
pendence of pw and a. In addition, the minimal step 
before perfect learning is well defined: pw = 1 ^ 2/(_LA^) 
or eg ^ ^Jl/{LN). Hence, we can find the interplay be- 
tween a and N. 
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2.2 2.4 
V ln(N) 

FIG. 4. Simulation results of a/, the number of rescaled 
steps necessary to achieve a perfect learning vs. \/TnN. Sim- 
ulations for diluted Ising per ceptron , in the case of binary 
output unit, with Ai = OAy/Qj/T (v), Ai = 0.5y/Qj/T 
(A) and Ai — Q.Q^JQj/T (o). Solid lines correspond to the 
linear fit of least square error. Inset: Simulation results of q/ 
vs. y/ L\nL for A'' = 630, L — 2, 3, 4, 7 and the limit values 
are chosen according to Eq. Solid line is least squared fit. 

In the binary output perceptron the generalization er- 
ror faUs down super-exponentiaUy, Eq. 
perfect learning is determined by 



22, Hence, the 



exp {-K{X, L)a^) - ^1/{LN), 



(26) 



and since we found in the last chapter that K decays lin- 
early with 1/L we can derive a/ from the last equation, 
af ^ \J L InLN. This result indicates quantitatively that 
for any chosen limit. A/, the number of learning step nec- 
essary to achieve perfect learning is finite as long as N 
and L are finite. 

Figure ^ presents results of a/ obtained in simulations 
for the diluted Ising perceptron with c = 0.4, c = 0.5, 
and c = 0.6, (Eq. |2^, |2j). Results were averaged over 
M{N) training sets, were values of M{N) ranging from 
5000 to 20 in accordance to TV which is varied between 
30 and 9000. To get results in lower dimension, N, we 
averaged over a larger number of simulations, M . 

One can see from the obtained values of a/ {N, c) in 
Figure ^, that the last quantity is indeed linear in VhiN. 
Note that the obtained slope in Figure ^ for c — 0.4 and 
c = 0.6 is the same as it is expected since is symmetric 
around c = 1/2. In the inset, one can see that af{L) 
in the case of iV = 630, indeed increases linearly with 
\JL\vlL. As L — > CX3 an infinite number of examples are 
needed for perfect learning, there is a crossover to the 
spherical case as was discussed in the previous chapter. 

Small deviations from a straight line in Figure ^ are 
expected to be a consequence of the following approxi- 



mations: (a) We took as an analytical curve (Eq. [2q ) 
only the asymptotic function which is an expansion valid 
in infinite a. (b) We neglected the polynomial corrections 
in Eq. |2^ such as \/\/a. (c) We derived Eq. |2^ from the 
analytical calculation of pj(a). The latter quantity itself 
is influenced by finite size effects. Extensive numerical 
simulations show that the corrections are linear in 1/iV 
p^-p6[ and hence they are negligible after clipping and 
getting pw (As in Eq. p^ . 

As was shown in previous chapters, c = 0.5 gives the 
best performance in the asymptotic learning procedure, 
lower ar for all N, and it is confirmed in our simulations. 
Figure ^ . In the thermodynamic limit iV — > oo, ay ^ cx3 
as expected. 



VII. CONTINUOUS UNIT 

We now study the case of continuous output percep- 
trons with finite depth. As long as one uses a continuous 
activation function, the generalization error decreases ex- 
ponentially, (see for instance [p^ , p"3|jl8| ) . In order to learn 
a rule which is defined by a finite depth vector, we used 
a spherical vector for the student weight vector, J, and 
clipped it in order to have a digital student weight vector 
. The updating of the spherical student weight vec- 
tor is done according to the gradient descent method as 
usual: 



Vje(>,e" 



(27) 



The error e(J'^,^^) measures the deviation of the stu- 
dent from the teacher's output for a particular input ^. 
The generalization error of a student is defined as the 
averaged error 



(28) 



Since the learning features of all kinds of the continu- 
ous transfer functions are more or less the same, we chose 
to concentrate in the "sin" activation function 



S = sin(fca:;). 



(29) 



The periodic activation function, sin, was found to be 
learnable given that the period k is small enough jl^ . In 
the following we will simplify our analysis by taking k=l 
and the learning rate i] = 1. Since the learning curves of 
the continuous version are the same as if there was a rule 
defined by a continuous teacher, (having the finite depth 
limitation is merely a special case of the spherical con- 
straint), and the learning rate we chose is small enough 
we find that perfect learning is an attractive fixed point 
in both scenarios. 

Linearizing the equations of motion around those fixed 
points results in the following form (which holds for all 
the continuous transfer functions): 
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Rj = 1 



Cl 



det V 
det V 



V22 exp (71Q;) 
V21 exp (71a) 



C2 



det y 



C2 



det V 



V12 exp (72 a) 
Vn exp (72Q;) 



(30) 



The two eigenvalues of V , 71, 72, are both negative. 
The constants ci, C2 are determined from the numerical 
solution of the equations of motion. 

In order to get a description of the discrete learning 
one has to use the mapping relations as in Eq. ^. The 
generalization error of the finite depth student directly 
depends on the order parameters, as can be found by 
taking the average over the local fields distribution, Eq. 
p8| . The general result of this calculation at the a ~* 00 
regime is 



eg ~ exp (-Coe 



\K\a^ 



(31) 



were K and Cq depend only on the learning rate ,77, the 
limits one chose, A; and the specific activation function. 
In the following we examine this result in the diluted 
Ising case. 




FIG. 5. Simulation results of pj (A) and pw (°) vs. a in 
the diluted Ising case. Solid lines are the numerical integrals 
(Eq. |l|,|32|). Inset: ln(-ln(eg)) vs. a obtained in simulations 
(o) with — 3000. Solid line is least squared linear fit, the 
slope was found to be 0.33. 



We performed simulations in the diluted Ising case, 
when the transfer function is sin. The development of 
the continuous order parameters in that case is described 
by the following equations of motion, 

dR 



^ = i [{Rj + l)D+-2Rje-^Q'-iRj-l)D^] 



[{Rj + Q.j)D+-2Q,je-^'^'^{Q.,-R.,)DJ\ 



da 



4[- 



,-2Q., 



+'i-Dt-2D_+{2E+- e-^^' - DX)] (32) 

with D± ^ e-(i+Qj±2fl.)/2 and E± = e-(i+9Q./±6«./)/2. 
As a ^ 00, one gets two eigenvalues, 71 ^ —0.30, 72 ~ 
—0.69. Using Eq. |l5|, rescahng Rw and Qw by the 
teacher's norm, 2/3, and taking the limit value, A, to be 
the one that yields the faster decay at the large a regime, 
A = 0.5^/qJ/T. Collecting everything we have 



Rw 



l-^^Pi^H^exp(-if,V-3°") 



exp (—0.15a) 



exp {~K(e 



2„0.30q\ 



(33) 



where K is determined by the initial conditions. The gen- 
eralization error as a function of the discrete parameters 
is 



l-d-+d+- 



1 



;e-2Q-+e-2) 



(34) 



with d± = e-(i+Q"'±2flw 
tion around Rw — >■ 1 



Expanding the last equa- 
and Qw 1, we obtain 



that the generalization error decays very fast, ^ 

exp(-i^2g0.30a)^ 

We ran simulations with N = 3000 and averaged over 
10 samples. In Figure |^ the development of the discrete 
as well as the continuous order parameters as a function 
of a are presented. The solid lines are the analytical nu- 
merical integrals of Eq. Note, the transition in this 
scenario from a poor generalization of the clipped ver- 
sion comparatively to that of the continuous one, to a 
situation in which the clipped version has a better per- 
formance, occurs in the same pT ^ 0.92 as in the binary 
unit. This quantity is related to the clipping rule and it 
is independent of the specific transfer function one tries 
to learn. 

The inset of Figure ^ shows the unique decay of the 
generalization error, in order to get linear line we plot- 
ted In (— In eg) as a function of a. According to the above 
analysis the slope should be 0.30 and we obtained in sim- 
ulations 0.33 ±0.01. It is in good agreement, considering 
the fact that we are dealing with an approximation which 
is valid only in the a ^ 00 and simulations results are in 
finite a. The generalization error of the clipped version 
for larger a (a > 7 in our case) gives better results than 
those predicted by the analysis, its values are exactly zero 
due to finite size effects discussed in chapter V. 

Following the same arguments used in order to find an 
estimation to the number of examples needed for gaining 
perfect learning, one finds that in the case of continuous 
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output af ^ In(lniV). It is obvious from the analytical 
calculations and the simulations above that clipping a 
continuous vector in order to learn a finite depth teacher 
results in an extremely fast learning. The learning in 
finite dimension is characterized by a/, above which one 
gets perfect learning of the discrete vector. All those 
unique characteristics of the discrete learning disappears 
as soon as the weight depth is of order of \/N as was 
found in chapter VI. 



VIII. CONCLUSIONS 

In this paper, we presented an analysis of the simplest 
neural network, the perceptron, that learns from exam- 
ples given by another perceptron, the teacher, which is 
confined to a discrete space. In fact, we used two stu- 
dents, a continuous precursor and its clipped version. 

We analyzed the new set of order parameters arising 
from the clipping method. We discussed the issue of how 
to clip and what set of limits. A;, is the best choice. We 
found that it depends specifically on the kind of optimiza- 
tion one imposes. We showed that after reaching some 
overlap, pr, a transition occurs and the clipped version 
results in a better performance then the non-clipped one. 
If one is interested in optimizing the learning in the sense 
of getting a better performance as soon as possible, then 
the minimizing px limits are the ones needed. However, 
if by optimizing one tries to get the fastest decrease pos- 
sible in the a ^ oo regime then the best choice is 'half 
the way', in-between the values. As we mentioned before, 
it is possible to have a dynamic set of values that inter- 
polates during the learning process between both values. 
We left this issue out of the scope of this paper. 

As one can see from the definitions in Eq. ^, it is 
only natural to choose the continuous weight vector not 
to be the one which is constrained to a hypersphere but 
a vector which is constrained to a hypercube space. It 
was shown that in the case of storing random patterns 
pre-training a continuous student whose weight vectors 
constrained to the volume of a hypercube results in a 
better performance [Q. It remains as an open question 
what is the quantitative benefit that one can gain in a 
learning procedure by using the cubical constrained and 
if a learning strategy could be designed which fulfills this 
constraint. 

We studied the case of a very large L and show a scaling 
relation between L and N arises from the analysis. For 
L ~ 0{^/N) the learning curve is the one that is typical 
to the continuous case. However, it should remain clear 
that learning is the same as having a continuous student 
unless a oo, pj 1. In that regime the fast de- 
cay that characterizes the clipped learning appears. All 
digital computers actually correspond to such a situation, 
where all available properties have a finite representation. 
The machine is using some kind of clipping by rounding 



the numbers. The differences, as predicted here, can be 
significant only in the a oo regime or small depth. Vi- 
sualizing them is usually impossible since they are smaller 
than the measurements scale. 
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